by Marc Angelo Acebedo
eps_fc and eps_act data_cleaning.ipynb, I isolated the following columns:¶features.csv¶firms.csvavgs.csv¶firms.csvaverage_type column#import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sb
import statsmodels.api as sm
import math
import random
import calendar
from matplotlib import cm
#rmse calculation
from sklearn.metrics import r2_score, mean_squared_error
%matplotlib inline
plt.style.use('bmh')
#define data directories
PATH_CLEAN = './data/clean/'
PATH_CLEAN_AVGS = './data/clean/averages/'
#define visuals destination
PATH_UNIVARIATE = './visuals/univariate/'
PATH_BIVARIATE = './visuals/bivariate/'
PATH_MULTIVARIATE = './visuals/multivariate/'
Import features and all averages
features = pd.read_csv(PATH_CLEAN + 'features.csv', low_memory = False)
avgs = pd.read_csv(PATH_CLEAN_AVGS + 'avgs.csv')
#import firm_ids for foreign key references
firm_ids = pd.read_csv(PATH_CLEAN + 'firms.csv')
#look at 5 random entries
features.sample(5)
avgs.sample(5)
print('FEATURES rows, columns = {}'.format(features.shape), '\n',
'AVERAGES rows, columns = {}'.format(avgs.shape))
features.dtypes
features.date = pd.to_datetime(features.date)
features.term = pd.to_datetime(features.term).dt.to_period('Q')
#verify dtypes
features.dtypes
avgs.dtypes
features.sample(5)
Our
features.csvdataset contains 167,660 entries with 5 features. Firm ID, feature, date, and term are all categorical variables while the value field is numeric and continuous. Even though the date and term fields are recorded as DateTime and Period objects respectively, they are still discrete, categorical data, because there is a limit to the year that can be recorded (1999 - 2020) and there cannot be more than 4 quarters (Q1 - Q5).Our
avgs.csvdataset contains 52,015 entries with 5 features. Firm ID, average type, time period, and feature are all categorical variables while the average field is numeric and continuous.
I'm interested in seeing the historical correlation of forecasted vs. actual EPS across all firms in the 2019 S&P Index.
As somebody with very little familiarity with the stock market, I decided to dedicate this thesis project to my exploration of a field I am not familiar with.. I read the book "A Beginner's Guide to the Stock Market" by Matthew R. Kratter, which piqued my interest in stock market trading—particularly dividend stocks. I decided that if I were going to educate myself further on the stock market, then this thesis project to end my senior year at NCF would be a perfect opportunity to directly explore this new interest. Not only would I be educating myself on how the stock market works, but I would also be working directly with stock market data, which will help me build further intuition in future stock market and finance-related projects.
These are the questions I'd like to pose in my exploratory stage:
Does average EPS prediction error depict any differences in trends whether by a yearly, quarterly, or full-term basis?
I generate “dumb EPS forecasts” by calculating the rolling mean of actual EPS from the past 2 quarters. How do my forecasted EPS forecasts compare to Bloomberg forecasts?
What differences and similarities emerge when analyzing the prediction error and percentage error of EPS forecasts?
Does statistical significance stay true among the relationships between actual EPS and forecasted EPS, regardless of forecasted type (raw data along with all yearly, quarterly, and twenty-year averages)?
How do EOD prices trend across all firms from 1999 - 2019?
For the broadest overview, I predict that overtime, EPS forecasts will continually become optimistic for those firms that consistently have high actual EPS values. Vice versa, overtime, EPS forecasts will continually become pessimistic for those firms that consistently have lower-than-expected EPS values. As for the other factors, I expect yearly values to show more consistency in pattern (since it is more intuitive to measure 20 years) than quarterly values (since economic situations are greatly diverse over the period of 20 years, no matter the period).
sb.set(style = "darkgrid")
def generate_missing_total(df, title_name, save_path, csv_name):
plt.figure(figsize = [10, 5])
plt.title('Missing Values per Column under ' + title_name, size = 20)
na_counts = df.isna().sum().sort_values(ascending = True)
na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
plt.xlabel('Count', size = 10)
plt.ylabel('Column Name', size = 10)
plt.savefig(save_path + csv_name)
features.isna().any()
generate_missing_total(features, '"Features"', PATH_UNIVARIATE, 'features-missing-total.png')
Observation 1: date is the field with the highest amount of missing data, with around 85,000 missing entries.
Observation 2: value contains around 26,000 missing entries.*
Observation 3: The gap in number of missing values between value and date is noticeably large.
Questions
1) For all the entries where date is not NaT, how large is the gap in missing values between values without the dropped dates and values with the dropped dates?
features_date_dropna = features[features['date'].notna()]
plt.figure(figsize = [10, 5])
plt.suptitle('Missing Values per Column under "Features"', size = 20)
plt.title('after dropping empty dates', size = 15)
na_counts = features_date_dropna.isna().sum().sort_values(ascending = True)
fig = na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
plt.constrained_layout = True
plt.xlabel('Count', size = 10)
plt.ylabel('Column Name', size = 10)
plt.savefig(PATH_UNIVARIATE + 'features-missing-date-dropna.png')
Observation 1: Keeping NaT dates, the amount of overall missing values is around 26,000. After dropping all the NaT dates (dropping eod_actand eps_fc_terms), the amount of missing values dropped to around 14,000.
Observation 2: Drawing from the previous observation, this means that eps_fc and eps_act both have around 12,000 missing values in total.
Observation 3: Effectively, the undropped columns, eod_act and eps_fc_terms, have around 14,000 missing values in total.
Questions:
2) How many missing values does each feature have, individually?
#turn feature values into index values
features_num_nan = features[['feature', 'value']].groupby('feature').count()
#count number of total values (NaN or not) per feature
feature_counts = features.feature.value_counts()
#create function to calculate number NaN
def calculate_nan(num_nan, counts, feature_name):
num_nan.loc[feature_name] = counts[feature_name] - num_nan.loc[feature_name]
#num NaN (feature) = num all values (NaN or not) - num all values (not NaN)
#create array of feature names
feature_names = ['eps_fc', 'eps_act', 'eod_act', 'eps_fc_terms']
#update all nan counts per average type
for feature_name in feature_names:
calculate_nan(features_num_nan, feature_counts, feature_name)
plt.figure(figsize = [10, 5])
sb.set(style = 'whitegrid')
plt.title('Missing Values per Feature under "Features"', size = 20)
plt.xlabel('Feature')
plt.ylabel('Count')
sb.barplot(x = features_num_nan.index, y = features_num_nan.value, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'features-missing-per-feature.png')
Observation 1: The feature with the most amount of missing values is eps_fc_terms.
Observation 2: The feature with the least amount of missing values is eps_act.
Observation 3: eod_act, eps_fc, and eps_fc_terms all have a very small gap in missing data comparedto eps_act
Observation 1: Below is a list of each feature and their corresponding estimated number of missing values:
eod_act : 7900eps_act : 5500eps_fc : 7050eps_fc_terms : 7100Observation 2: eps_fc and eps_act have around 12,000 missing values total, which is consistent with our previous observation.
Observation 3: eod_act and eps_fc_terms have around 14,000 missing values total, which is also consistent with our previous observation.
avgs.isna().any()
#averages
generate_missing_total(avgs, 'Averages', PATH_UNIVARIATE, 'avgs-missing-total.png')
Observation 1: The average column contains 6,000 missing data entries.
Observation 2: The time_period column contains exactly 2,000 missing data entries.
Observation 3: The gap in missing values between average and time_period is 4,000.
Observation 3: By default, all entries with average_type of twenty_year should have NaT fields for time_period.
Questions:
3) After dropping all NaT fields under time_period, how large will the gap in missing values be between values with the NaT dates and values after dropping the NaT dates?
4) How many missing average values does each average_type contain?
avgs_date_dropna = avgs[avgs['time_period'].notna()]
plt.figure(figsize = [10, 5])
plt.suptitle('Missing Values per Column under "Averages"', size = 20)
plt.title('after dropping empty time periods', size = 15)
na_counts = avgs_date_dropna.isna().sum().sort_values(ascending = True)
fig = na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
plt.constrained_layout = True
plt.xlabel('Count', size = 10)
plt.ylabel('Column Name', size = 10)
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-date-dropna.png')
Observation 1: After dropping all NaT dates under time_period, there are still 6,000 missing averages. This is number is consistent with the number of missing averages even before dropping NaT dates. Thus, all entries with a missing time period (aka all twenty_year average types) did not contain any empty data.
To summarize the above, the number of missing average values remains the same whether or not you drop all entries with empty time periods, aka all twenty_year average types.
#turn avg values into index values
avgs_num_nan = avgs[['feature', 'average']].groupby('feature').count()
#count number of total values (NaN or not) per feature
avgs_counts = avgs.feature.value_counts()
#update number of missing values for each average type
for feature_name in feature_names:
calculate_nan(avgs_num_nan, avgs_counts, feature_name)
avgs_num_nan
plt.figure(figsize = [10, 5])
# sb.set(style = 'whitegrid')
plt.title('Missing Values per Feature under "Averages"', size = 20)
plt.xlabel('Feature')
plt.ylabel('Count')
sb.barplot(x = avgs_num_nan.index, y = avgs_num_nan.average, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-per-feature.png')
Observation 1: eps_fc has the most missing values at around 1650.
Observation 2: eps_act has the least amount of missing values at around 1250.
Observation 3: Under "Averages", the gap in missing values between each average type is larger than the missing values under "Features".
Observation 1: Each average type has the following approximate number of missing averages:
eod_act : 1550eps_act : 1250eps_fc : 1650eps_fc_terms : 1590#turn avg values into index values
avgs_num_avg = avgs[['average_type', 'average']].groupby('average_type').count()
plt.figure(figsize = [10, 5])
# sb.set(style = 'whitegrid')
plt.title('Missing Values per Average Type under "Averages"', size = 20)
plt.xlabel('Average Type')
plt.ylabel('Count')
sb.barplot(x = avgs_num_avg.index, y = avgs_num_avg.average, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-per-avg-type.png')
Observation 1:
quarterly: ~7500twenty_year: ~2500yearly: ~3600Observation 2: The gap in missing values is greatly diversified across all average types.
Since 505 firms is too much to fit into one single visual, I decided to split them apart by focusing on the 20 most common firm ids and the 20 rarest firm ids.
def generate_pct_bar(df, cat_var, color):
cat_counts = df[cat_var].value_counts()
ax = sb.countplot(data = df, y = cat_var, order = cat_counts.index, palette = color)
n_points = df.shape[0]
locs, labels = plt.yticks()
for p in ax.patches:
percentage = '{:0.1f}%'.format(100 * p.get_width()/n_points)
x = p.get_x() + p.get_width() + 0.02
y = p.get_y() + p.get_height()/2
ax.annotate(percentage, (x, y), size = 20)
features_firm_id_top = features.firm_id.value_counts()[:20].index
features_firm_id_top_lim = features.loc[features.firm_id.isin(features_firm_id_top)]
#check there are only 50 unique values
features_firm_id_top_lim.firm_id.nunique()
plt.figure(figsize = [20, 15])
x = features_firm_id_top_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('Top 20 Most Common Firm IDs under "Features"', size = 25)
plt.savefig(PATH_UNIVARIATE + 'features-firm-id-count-top.png');
plt.show();
features_firm_id_bottom = features.firm_id.value_counts()[-20:].index
features_firm_id_bottom_lim = features.loc[features.firm_id.isin(features_firm_id_top)]
#check there are only 50 unique values
features_firm_id_bottom_lim.firm_id.nunique()
plt.figure(figsize = [20, 15])
x = features_firm_id_bottom_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('20 Least Common Firm IDs under "Features"', size = 25)
plt.savefig(PATH_UNIVARIATE + 'features-firm-id-count-bottom.png');
plt.show();
Observation 1: Both the 20 most common and least common Firm IDs all make up the same proportion of existing Firm IDs: 5.0% each. This means that under "Features", there is a consistent count among all Firm IDs at around 335.
#check count consistency among firm_ids
firm_counts = features.firm_id.value_counts()
np.unique(firm_counts.sort_values().values)
I discovered that all firm id counts are consistent across the entire features.csv dataset at 332 entries per firm id. There are no null firm ids.
import itertools
plt.figure(figsize = [13, 8])
#set palette
base_color = sb.color_palette()[5]
#countplot
plt.subplot(1, 2, 1)
sb.countplot(data = features, x = 'feature', order = features.feature.value_counts(ascending = True).index,
color = base_color)
frame = plt.gca()
#pie chart
plt.subplot(1, 2, 2)
sorted_counts = features['feature'].value_counts()
plt.pie(features.feature.value_counts(), startangle = 90, counterclock = False,
autopct='%1.2f%%', labels = features.feature.value_counts().index);
plt.axis('square');
#overall graphic
plt.suptitle('All Features by Count and Percentage under "Features"', size = 20)
plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.savefig(PATH_UNIVARIATE + 'features-feature-pct-count.png')
Observation 1: eps_act, eps_fc, and eod_act all show consistent counts at around 42500 entries each (25.30% each).
Observation 2: eps_fc_terms is the only feature type to deviate from the others, having less entries at around 41,000 (24.10%).
Observation 3: It makes sense that eps_fc_terms contains missing data, because the year 1999 was not included while gathering this data. (effectively removing 2020 entries).
#double check that eps_fc and eps_act are the only features to have null Date entries
features[features.date.isna()].feature.unique()
Since we acknowledged that the
datecolumn is set to NaT foreps_fcandeps_act, we will write all our interpretations accordingly.
eps_fc and eps_act.¶features_years = features.date.dt.year.dropna().astype(int)
features_months = features.date.dt.month.dropna().astype(int)
#years
plt.figure(figsize = [15, 5])
ax = sb.countplot(x = features_years, color = sb.color_palette()[0])
ax.set(xlabel = 'Year', ylabel = 'Count')
ax.set_title('Frequency of Years under "Features"', size = 20)
plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-date-years-count.png')
plt.show()
Observation 1: All years between 2000 - 2018 have a consistent count at around 4,000 for all firms.
eps_fc_terms entries.Observation 2: The years 1999 and 2019 are both inconsistent and less than the number of 4,000 counts for all other Date years.
Observation 3: The year 1999 has 2500 non-null entries.
Observation 4: The year 2019 has 3500 non-null entries.
#months
features_months = features_months.apply(lambda x: calendar.month_abbr[x])
months_order = ['Jan', 'Mar', 'Apr', 'Jun', 'Jul', 'Sep', 'Oct', 'Dec']
plt.figure(figsize = [10, 5])
ax = sb.countplot(data = features, x = features_months, color = sb.color_palette()[0], order = months_order)
ax.set(xlabel = 'Month', ylabel = 'Count')
ax.set_title('Frequency of Months under "Features"', size = 20)
sb.despine(offset = 10, trim = False)
plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-date-months-count.png')
plt.show();
Observation 1: September has the most number of counts at around 15,000.
Observation 2: October has the least number of counts at around 9,000.
Observation 3: The trend in counts of months under "Date" fluctuates. It is not a linear or exponential pattern; it seems that there is a "peak" in counts every 2nd recorded month from January.
As noted earlier, there are no missing values under the Term column. Therefore, all graphs below do account for
eps_actandeps_fc.
#years
plt.figure(figsize = [20, 10])
ax = sb.countplot(data = features, x = features.term.dt.year, color = sb.color_palette()[0])
ax.set(xlabel = 'Year', ylabel = 'Count')
ax.set_title('Frequency of Term Years under "Features"', size = 20)
plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-term-years-count.png')
plt.show();
Observation 1: All years from 2000 to 2019 have consistent counts at 8,000 per year.
Observation 2: Unlike the years under Date, 1999 is the only year that does not follow the general trend. There are 6000 recorded entries containing 1999. This means that 2,000 entries do not contain the year 1999.
#quarter
plt.figure(figsize = [10, 5])
ax = sb.countplot(data = features, x = features.term.dt.quarter, color = sb.color_palette()[0])
ax.set(xlabel = 'Quarter', ylabel = 'Count')
ax.set_title('Frequency of Term Quarters under "Features"', size = 20)
plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-term-quarters-count.png')
plt.show();
Observation 1: There is a consistent number of quarters under term, unlike years.
Observation 2: This means that quarters is a more stable, reliable variable to examine under Term unlike years, which should be examined more closely in regard to which feature(s) contains that year.
def generate_hist(df, x, bins, title, xlabel, ylabel, save_path, csv_name):
plt.figure(figsize = [14, 7])
plt.hist(data = df, x = x, bins = bins, color = 'palevioletred')
plt.title(title, size = 25)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.savefig(save_path + csv_name)
value_bins = np.arange(features.value.min(), features.value.max() + 100, 100)
generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features',
'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
'Count', PATH_UNIVARIATE, 'features-value-hist.png')
Observation 1: The trend in value counts is heavily right-skewed
Observation 2: The value range 0-100 contains the highest concentration of data, with over 120,000 entries.
Observation 3: The value range 0-300 contains the "bulk" of all the data, which means the surrounding x-axis values are all outliers.
Questions:
4) How do value counts under Features look like after removing all outliers around the value range 0-300?
value_bins = np.arange(-5, 200 + 10, 10)
value_hist = generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features (without outliers)',
'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
'Count', PATH_UNIVARIATE, 'features-value-hist-zoom-1.png')
Observation 1: The graph appears to be a normal distribution with a strong right skew. This is still unlike the graph with all the outliers, where the right skew is much more pronounced and prominent.
Observation 2: Instead of breaking the bins up by bin widths of 100, the bin width here is 10.
Observation 3: The value range 0-10 contains the bulk of all the data. This is the highest bar, with counts of around 12,000.
value_bins = np.arange(0, 10 + 0.05, 0.05)
value_hist = generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features (range 0-10)',
'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
'Count', PATH_UNIVARIATE, 'features-value-hist-zoom-2.png')
Observation 1: The graph, even when zoomed in, still retains a strong right skew.
Observation 2: The "bulk" of the data is around the value range 0.02 to 0.05: the "peak" of the distribution.
Just to make sure this graph is skewed heavily to the right, let's create a kernel density curve:
def generate_distplot(data, bins):
fig = plt.figure(figsize = [14, 7])
ax = sb.distplot(data, bins = bins, color = 'hotpink')
ax.minorticks_on()
return fig, ax
value_bins = np.arange(features.value.min(), features.value.max() + 20, 20)
generate_distplot(features.value.dropna(), bins = value_bins)
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Distribution of all Values under "Features"', size = 25);
plt.savefig(PATH_UNIVARIATE + 'features-value-dist.png')
Observation: As shown by the kernel density curve, the distribution of values under Features is heavily right-skewed. This means that most values are clustered around the left-side of the distribution where the mean, median, and mode are all located.
avgs_firm_id_top = avgs.firm_id.value_counts()[:20].index
avgs_firm_id_top_lim = avgs.loc[avgs.firm_id.isin(avgs_firm_id_top)]
#check there are only 20 unique values
avgs_firm_id_top_lim.firm_id.nunique()
plt.figure(figsize = [15, 15])
x = avgs_firm_id_top_lim
num_bins = 5
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID')
plt.ylabel('Count')
plt.title('Top 20 Most Common Firm IDs under "Averages"', size = 20)
plt.savefig(PATH_UNIVARIATE + 'avgs-firm-id-count-top.png');
plt.show();
avgs_firm_id_bottom = avgs.firm_id.value_counts()[-20:].index
avgs_firm_id_bottom_lim = avgs.loc[avgs.firm_id.isin(avgs_firm_id_top)]
#check there are only 20 unique values
avgs_firm_id_bottom_lim.firm_id.nunique()
plt.figure(figsize = [20, 15])
x = avgs_firm_id_bottom_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('20 Least Common Firm IDs under "Averages"', size = 25)
plt.savefig(PATH_UNIVARIATE + 'avgs-firm-id-count-bottom.png');
plt.show();
Observation 1: Both the 20 most common and least common Firm IDs all make up the same proportion of existing Firm IDs: 5.0% each. This means that under "Averages", there is a consistent count among all Firm IDs at around 105.
#check consistency of counts among firm_ids
firm_counts = avgs.firm_id.value_counts()
np.unique(firm_counts.sort_values().values)
I discovered that all firm id counts are consistent across the entire avgs.csv dataset at 105 entries per firm id. There are no null firm ids.
value_bins = np.arange(avgs.average.min(), avgs.average.max() + 100, 100)
generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages',
'Average(twenty-year, quarterly, yearly)',
'Count', PATH_UNIVARIATE, 'avgs-avg-hist.png')
Observation 1: The trend in value counts is heavily right-skewed
Observation 2: The value range 0-100 contains the highest concentration of data, with around 36,000 entries.
Observation 3: The value range 0-300 contains the "bulk" of all the data, which means the surrounding x-axis values are all outliers.
Observation 4: There is a all values, whether under Features or Averages, share a common theme of taking up the "bulk" of data in the same bin range
Questions:
6) How do value counts under Averages look like after removing all outliers around the value range 0-300?
value_bins = np.arange(-5, 200 + 10, 10)
generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages (without Outliers)',
'Average(twenty-year, quarterly, yearly)',
'Count', PATH_UNIVARIATE, 'avgs-avg-hist-zoom-1.png')
Observation 1: The graph appears to be a normal distribution with a strong right skew. This is still unlike the previous graph with all the outliers, where the right skew is much more pronounced and prominent.
Observation 2: Instead of breaking the bins up by bin widths of 100, the bin width here is 10.
Observation 3: The value range 0-10 contains the bulk of all the data. This is the highest bar, with counts of around 12,000.
Observation 4: Among the bins from 10 to 110 that contain the other 2,000 counts, the bin 30-40 contains the "bulk" of that data at around 2,000 counts.
value_bins = np.arange(0, 10 + 0.05, 0.05)
value_hist = generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages (range 0-10)',
'Average',
'Count', PATH_UNIVARIATE, 'avgs-avgs-hist-zoom-2.png')
Observation 1: The graph, even when zoomed in, still retains a strong right skew.
Observation 2: The "bulk" of the data is around the value range 0.02 to 0.05: the "peak" of the distribution.
Just to make sure this graph is skewed heavily to the right, let's create a kernel density curve:
value_bins = np.arange(avgs.average.min(), avgs.average.max() + 20, 20)
generate_distplot(avgs.average.dropna(), bins = value_bins)
plt.xlabel('Average')
plt.ylabel('Density')
plt.title('Distribution of all Averages under "Averages"', size = 25);
plt.savefig(PATH_UNIVARIATE + 'avgs-average-dist.png')
Observation: As shown by the kernel density curve, the distribution of average under Averages is heavily right-skewed. This means that most averages are clustered around the left-side of the distribution where the mean, median, and mode are all located.
A KEY TAKEAWAY: There is a strong congruency between the distribution patterns of stock prices (values) under features.csv and average prices (averages) under Averages.
plt.figure(figsize = [10, 5])
cat_order = avgs.average_type.value_counts().index
sb.countplot(data = avgs, x = 'average_type', color = 'hotpink', order = cat_order)
plt.xlabel('Average Type')
plt.ylabel('Count')
plt.title('Average Type by Count', size = 20)
plt.savefig(PATH_UNIVARIATE + 'avgs-avgtype-count.png')
Observation 1: Under Averages, yearly data has the highest number of counts while twenty-year data has the least number of counts. This all makes sense, as there are more yearly averages than quarters per firm, and more quarters than twenty-year entries per firm.
Note: number of entries may vary for yearly averages because the year 1999 and 2019 are missing for some features.
#keep for analysis, delete when done w/ report
len(avgs.query('average_type == "yearly" and firm_id == 0'))
colors = random.choices(list(mcolors.CSS4_COLORS.values()), k = 3)
plt.figure(figsize = [10, 5])
cs=cm.Set1(np.arange(40)/40.)
plt.pie(avgs.average_type.value_counts(), startangle = 90, counterclock = False,
autopct='%1.2f%%', labels = avgs.average_type.value_counts().index, colors = colors);
plt.suptitle('Average Type by Percentage', size = 20)
plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.axis('square');
plt.savefig(PATH_UNIVARIATE + 'avgs-avgtype-pie.png')
Observation 1: As expected, yearly average types make up the majority of all entries at 80.58%.
Observation 2: As expected, twenty-year average types make up the least portion of all entries at 3.88%.
Observation 3: These percentages are consistent with the previous counts.
plt.figure(figsize = [15, 15])
cat_order = avgs.time_period.value_counts().sort_index().index
sb.countplot(data = avgs, x = 'time_period', color = 'hotpink', order = cat_order)
plt.xlabel('Time Period (yearly or quarterly)')
plt.ylabel('Count')
plt.title('Yearly & Quarterly Time Periods by Count (Averages)', size = 20)
plt.savefig(PATH_UNIVARIATE + 'avgs-time-period-hist.png')
I decided to portray both yearly and quarterly Time Periods in the same countplot because q1-q4 and the years 2000-2019 show consistent counts.
Observation 1: All years from 2000-2019 and all quarters show consistent counts, at just over 2,000 entries each.
Observation 2: The year 1999 has the least amount of counts, at only around 1510. This is because the feature eps_fc_terms does not contain the year 1999.
Observation 3: The time period for twenty-year averages does not show because by default, all Time Period entries associated with average type of twenty-year is set to null.
plt.figure(figsize = [14, 7])
#set palette
base_color = sb.color_palette()[5]
#countplot
plt.subplot(1, 2, 1)
cat_order = avgs.feature.value_counts().sort_index(ascending = False).index
# generate_pct_bar(avgs, 'feature', 'Blues_d')
sb.countplot(data = avgs, x = avgs.feature, color = 'hotpink', order = cat_order)
#pie chart
plt.subplot(1, 2, 2)
sorted_counts = avgs['feature'].value_counts()
plt.pie(avgs.feature.value_counts(), startangle = 90, counterclock = False,
autopct='%1.2f%%', labels = avgs.feature.value_counts().index);
cat_order = avgs.feature.value_counts().sort_index(ascending = False).index
plt.axis('square');
#overall graphic
plt.suptitle('All Features by Count and Percentage under "Averages"', size = 20)
# plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.savefig(PATH_UNIVARIATE + 'avgs-feature-hist-pie.png')
Observation 1: eps_act, eps_fc, and eod_act all show consistent counts at around 13,000 entries each (25.24% each).
Observation 2: eps_fc_terms is the only feature type to deviate from the others, having less entries at around 12,500 (24.27%).
Observation 3: It makes sense that eps_fc_terms contains missing data, because the year 1999 was not included while gathering this data. This effectively removes around 500 entries: not a substantial amount of data.
General rule: The closer the eps_act - eps_fc difference is to 0, the more accurate the forecast.
temp = features[features.feature.isin(['eps_fc', 'eps_act'])]
plt.figure(figsize = [10, 10])
sb.stripplot(x = 'feature', y = 'value', data = features, jitter = True)
plt.title('Stock Price Types and ', size = 20)
plt.xlabel('Stock Price Type')
plt.ylabel('Value')
plt.xticks([0, 1, 2, 3], ['EOD', 'Forecasted EPS (3 Months Prior)', 'Forecasted EPS', 'Actual EPS'])
plt.savefig(PATH_BIVARIATE + 'features-feature-value.png')
plt.show();
Observation 1: EOD Price has the most variability among the other stock price types.
Observation 2: Forecasted EPS (3 months prior) and Forecasted EPS (current) are the only stock price types to appear uniforn in distribution: their values are clustered around 0.
Observation 3: Actual EPS has the most outliers compared to the other stock price types: 5 outliers. And since actual EPS has the most outliers, it turns out that forecasters' predictions have been too uniform and stable in the approach of analyzing stock market trends.
features_eod = features[features.feature == 'eod_act']
#create 5-year moving averages to "smooth" out all future graphs
features_eod['ma_value'] = features_eod.value.rolling(12).mean()
plt.figure(figsize = [20, 10])
sb.pointplot(x = 'term', y = 'value', data = features_eod, errwidth = 0.5)
plt.xticks(rotation = 'vertical')
plt.title('Average EOD Price by Term Across All Firms', size = 20)
plt.xlabel('Term')
plt.ylabel('Average EOD Price')
plt.savefig(PATH_BIVARIATE + 'features-eod-term.png')
Observation 1: The graph depicts a positive linear trend from all quarters spanning 1999 - 2019.
Observation 2: The average EOD Price troughs during 2009 Quarter 1. The start of the trend in this downfall starts in 2007 Quarter 4, but the average EOD price recovers shortly after, only 1 quarter later on 2009, Quarter 2.
In layman's terms, this generally positive linear trend shows that all 505 firms in the S&P 2019 Index have been getting "richer" over the past 20 years.
plt.figure(figsize = [20, 10])
sb.pointplot(x = 'term', y = 'ma_value', data = features_eod, errwidth = 0.5)
plt.xticks(rotation = 'vertical')
plt.suptitle('Average EOD Price by Term Across All Firms', size = 20, y = .93)
plt.title('in 3-year Moving Averages from 1999 - 2019', size = 15)
plt.xlabel('Term')
plt.ylabel('Average EOD Price')
plt.savefig(PATH_BIVARIATE + 'features-eod-term-ma.png')
The EOD graph could be used to make correlations with EPS forecasts and how the firm is getting "richer" overtime.
#isolate eps_fc and eps_act in their own columns to get their difference
features_diffs = features[features.feature.isin(['eps_fc', 'eps_act'])][['firm_id', 'term', 'value', 'feature']]
#isolate eps_fc and eps_act
eps_fc_diffs = features_diffs[features_diffs.feature == 'eps_fc'].rename(columns = {'value' : 'eps_fc_value'})[['firm_id', 'term', 'eps_fc_value']]
eps_act_diffs = features_diffs[features_diffs.feature == 'eps_act'].rename(columns = {'value' : 'eps_act_value'})[['firm_id', 'term', 'eps_act_value']]
#merge, creating separate columns for eps_fc and eps_act values
act_fc_diffs = eps_fc_diffs.merge(eps_act_diffs, how = 'inner', left_on = ['firm_id', 'term'], right_on = ['firm_id', 'term'])
act_fc_diffs.sample(5)
#create 'difference' column
act_fc_diffs['difference'] = act_fc_diffs['eps_act_value'] - act_fc_diffs['eps_fc_value']
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = 'term', y = 'difference', color = 'purple', errwidth = .5)
plt.suptitle('Average Prediction Error of EPS Values by Term', size = 20, y = .93)
plt.title('across All Firms', size = 19)
plt.xticks(rotation = 'vertical')
plt.xlabel('Term (Year and Quarter)')
plt.ylabel('Average Prediction Error')
plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-term.png')
plt.show();
Observation 1: Forecasters were most optimistic in 2008, Quarter 4.
Observation 2: Forecasters were most pessimistic in 2000, Quarter 4.
Observation 3: The trend indicates no linear relationship; it follows in a stable, one-dimensional direction with average prediction errors center around 0.
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.year, y = 'difference', color = 'purple', errwidth = .5)
plt.suptitle('Average Prediction Error of EPS Values by Year', size = 20, y = .93)
plt.title('across All Firms', size = 19)
# plt.xticks(rotation = 'vertical')
plt.xlabel('Year')
plt.ylabel('Average Prediction Error')
plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-year.png')
plt.show();
Observation 1: Forecasters were most pessimistic in the year 2000. From the error bar, the year 2000 shows one of the widest variances in average prediction errors ranging from 0 until 0.5
Observation 2: Forecasters were most optimistic in the year 2008. From the error bar, the year 2008 shows one of the widest variances in average prediction errors ranging from -0.25 until -1.5.
Observation 3: The overall trend stays consistent with the previous trends by term: consistent, relatively stable, and containing no slope.
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.quarter, y = 'difference', color = 'purple', errwidth = .5)
plt.suptitle('Average Prediction Error of EPS Values by Quarter', size = 20, y = .93)
plt.title('across All Firms', size = 19)
# plt.xticks(rotation = 'vertical')
plt.xlabel('Quarter')
plt.ylabel('Average Prediction Error')
plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-quarter.png')
plt.show();
Observation: The later the quarter, the more optimistic forecasters become in their predictions. I am assuming this is because each new quarter brings the familiarity of the current year's stock market dynamics.
#3-year moving averages
act_fc_diffs['ma_difference'] = act_fc_diffs.difference.rolling(12).mean()
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = 'term', y = 'ma_difference', color = 'purple', errwidth = .5)
plt.suptitle('Average Prediction Error of EPS Values by Term', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xticks(rotation = 'vertical')
plt.xlabel('Term (Year and Quarter)')
plt.ylabel('Average Prediction Error')
plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-term-ma.png')
plt.show();
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.year, y = 'ma_difference', color = 'purple', errwidth = .5)
plt.suptitle('Average Prediction Error of EPS Values by Year', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xlabel('Year')
plt.ylabel('Average Prediction Error')
plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-year-ma.png')
plt.show();
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.quarter, y = 'ma_difference', color = 'purple', errwidth = .5)
plt.suptitle('Average Prediction Error of EPS Values by Quarter', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xlabel('Quarter')
plt.ylabel('Average Prediction Error')
plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-quarter-ma.png')
plt.show();
Observation 1: Compared to their non-moving-average counterparts, the above pointplots plotting 3-year moving averages show a consistent slopeless trned, including periods of optimism vs. pessimism.
#array of all firm ids
ids = firm_ids.firm_id.values
#put eps_fc and eps_act averages into separate DFs
avgs_eps_fc = avgs[avgs['feature'] == 'eps_fc']
avgs_eps_act = avgs[avgs['feature'] == 'eps_act']
#isolate eps_fc and eps_act DFs by "twenty_year" average type
avgs_eps_fc_twenty = avgs_eps_fc[avgs_eps_fc['average_type'] == 'twenty_year']
avgs_eps_act_twenty = avgs_eps_act[avgs_eps_act['average_type'] == 'twenty_year']
#df of only twenty-year average types
avgs_twenty = pd.concat([avgs_eps_fc_twenty, avgs_eps_act_twenty])
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_twenty = avgs_twenty.query('feature == "eps_fc"').average.values
avg_eps_act_twenty = avgs_twenty.query('feature == "eps_act"').average.values
twenty_diff_data = list(zip(ids, avg_eps_fc_twenty, avg_eps_act_twenty))
twenty_diffs = pd.DataFrame(twenty_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
twenty_diffs['difference'] = twenty_diffs['average_eps_act'] - twenty_diffs['average_eps_fc']
twenty_diffs['average_type'] = 'twenty_year'
twenty_diffs.sample(5)
plt.figure(figsize = [20, 20])
plt.title('Prediction Error among Twenty-Year Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
plt.plot(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year.png')
plt.show();
#isolate eps_fc and eps_act DFs by "quarterly" average type
avgs_eps_fc_quarter = avgs_eps_fc[avgs_eps_fc['average_type'] == 'quarterly']
avgs_eps_act_quarter = avgs_eps_act[avgs_eps_act['average_type'] == 'quarterly']
#df of only quarterly averages
avgs_quarter = pd.concat([avgs_eps_fc_quarter, avgs_eps_act_quarter])
#get the average average separately per firm
avgs_quarter_sep = avgs_quarter.groupby(['firm_id', 'feature']).mean().reset_index()
avgs_quarter_sep.shape[0]
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_quarter = avgs_quarter_sep.query('feature == "eps_fc"').average.values
avg_eps_act_quarter = avgs_quarter_sep.query('feature == "eps_act"').average.values
quarter_diff_data = list(zip(ids, avg_eps_fc_quarter, avg_eps_act_quarter))
quarter_diffs = pd.DataFrame(quarter_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
quarter_diffs['difference'] = quarter_diffs['average_eps_act'] - quarter_diffs['average_eps_fc']
quarter_diffs['average_type'] = 'quarter'
quarter_diffs.head()
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among Quarterly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
plt.plot(quarter_diffs['firm_id'], quarter_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter.png')
plt.show();
#isolate eps_fc and eps_act DFs by "yearly" average type
avgs_eps_fc_year = avgs_eps_fc[avgs_eps_fc['average_type'] == 'yearly']
avgs_eps_act_year = avgs_eps_act[avgs_eps_act['average_type'] == 'yearly']
#df of only yearly averages
avgs_year = pd.concat([avgs_eps_fc_year, avgs_eps_act_year])
#get the average average separately per firm
avgs_year_sep = avgs_year.groupby(['firm_id', 'feature']).mean().reset_index()
avgs_year_sep.sample(5)
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_year = avgs_year_sep.query('feature == "eps_fc"').average.values
avg_eps_act_year = avgs_year_sep.query('feature == "eps_act"').average.values
year_diff_data = list(zip(ids, avg_eps_fc_year, avg_eps_act_year))
year_diffs = pd.DataFrame(year_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
year_diffs['difference'] = year_diffs['average_eps_act'] - year_diffs['average_eps_fc']
year_diffs['average_type'] = 'year'
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among Yearly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
# plt.title('(Average Forecasted EPS - Average Actual EPS)')
# plt.rcParams = plt.rcParamsDefault
plt.plot(year_diffs['firm_id'], year_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year.png')
plt.show();
#put all yearly, quarterly, and twenty-year differences into one DF
all_diffs = pd.concat([twenty_diffs, quarter_diffs, year_diffs], sort = False)
all_diffs.sample(5)
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Differences in Yearly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
# plt.title('(Average Forecasted EPS - Average Actual EPS)')
# plt.rcParams = plt.rcParamsDefault
plt.plot(year_diffs['firm_id'], year_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year.png')
plt.show();
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among All Averages by Firm ID\n(Twenty-Year, Yearly, and Quarterly)', size = 25)
plt.plot(all_diffs['firm_id'], all_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-all.png')
plt.show();
Observation: Since all graphs depicting the average differences of forecasted and actual EPS, it is safe to use the all_diffs DF to answer for the yearly, quarterly, and twenty-year interpretations.
#helper function to convert a DF for firm_ids to their tickers
def convert_ids_to_tickers(series):
series = series.apply(lambda x: firm_ids[firm_ids.firm_id == x].values[0][1]).values
return [x.upper() for x in series]
#add column for absolute difference
all_diffs['difference_abs'] = all_diffs['difference'].abs()
#create column referring to firm ticker
all_diffs['firm_ticker'] = convert_ids_to_tickers(all_diffs.firm_id)
#rename column and values to fit legend
all_diffs.average_type = all_diffs.average_type.str.replace('_', ' ').str.capitalize()
all_diffs = all_diffs.rename(columns = {'average_type' : 'Average Type'})
Twenty-Year
#isolate twenty-year average type
twenty_diffs = all_diffs[all_diffs['Average Type'] == 'Twenty year']
#reorder firms by absolute difference
twenty_diffs_top = twenty_diffs.sort_values(by='difference_abs', ascending = False)
twenty_diffs_bottom = twenty_diffs.sort_values(by='difference_abs', ascending = True)
#drop rows with duplicate firm ids, get top 20 firm ids
twenty_diffs_top = twenty_diffs_top.drop_duplicates(subset = 'firm_id', keep = 'first').head(20)
twenty_diffs_bottom = twenty_diffs_bottom.drop_duplicates(subset = 'firm_id', keep = 'first').head(20)
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = twenty_diffs_top.firm_ticker, y = twenty_diffs_top.difference)
plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Twenty-Year Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year-top.png')
The 9 most inaccurately predicted firm tickers are AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, and NVR.
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = twenty_diffs_bottom.firm_ticker, y = twenty_diffs_bottom.difference)
plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Twenty-Year Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year-bottom.png')
At first, the trend appears unstable, and this is a tricky and wrong assumption. Visualizing that absolute value of each point proves that this trend among the top 20 most accurately predicted firms is a steady, linear pattern.
Quarterly
#isolate twenty-year average type
quarter_diffs = all_diffs[all_diffs['Average Type'] == 'Quarter']
#create temp DF to get absolute mean difference of all firms
quarter_diffs_means = quarter_diffs.groupby('firm_ticker').mean().reset_index()
#get firm tickers of top/bottom 20 firms
quarter_diffs_top_ids = quarter_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
quarter_diffs_bottom_ids = quarter_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
#filter quarterly DF for top/bottom firm tickers
quarter_diffs_top = quarter_diffs[quarter_diffs.firm_ticker.isin(quarter_diffs_top_ids)]
quarter_diffs_bottom = quarter_diffs[quarter_diffs.firm_ticker.isin(quarter_diffs_bottom_ids)]
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = quarter_diffs_top.firm_ticker, y = quarter_diffs_top.difference, order = quarter_diffs_top_ids)
plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Quarterly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter-top.png')
The 9 most inaccurately predicted firms by quarterly average prediction error are AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, NVR, and AZO.
#plot firm IDs as CATEGORICAL variables by their respective quarterly differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = quarter_diffs_bottom.firm_ticker, y = quarter_diffs_bottom.difference, order = quarter_diffs_bottom_ids)
plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Quarterly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter-bottom.png')
Similar to the previous twenty-year prediction error point plot, this graph shows a general oscillation between negative and positive values. However, this does not matter in the long run, because the absolute distance between 0 and all the above y-values decreases.
Yearly
#isolate yearly average types
year_diffs = all_diffs[all_diffs['Average Type'] == 'Year']
#create temp DF to get absolute mean difference for all firms
year_diffs_means = year_diffs.groupby('firm_ticker').mean().reset_index()
#get firm tickers of top/bottom 20 firms
year_diffs_top_ids = year_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
year_diffs_bottom_ids = year_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
#filter yearly DF for top/bottom firm tickers
year_diffs_top = year_diffs[year_diffs.firm_ticker.isin(year_diffs_top_ids)]
year_diffs_bottom = year_diffs[year_diffs.firm_ticker.isin(year_diffs_bottom_ids)]
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = year_diffs_top.firm_ticker, y = year_diffs_top.difference, order = year_diffs_top_ids)
plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Yearly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year-top.png')
The top 7 most inaccurately predicted firms by yearly average prediction errors are AIG, LRCX, GM, BKNG, CHTR, and NVR.
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = year_diffs_bottom.firm_ticker, y = year_diffs_bottom.difference, order = year_diffs_bottom_ids)
plt.suptitle('Top 20 Most Accurately Predicted Firms', size = 20, y = .93)
plt.title('as Yearly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year-bottom.png')
Consistent with previous trends, this graph of the 10 most accurate firms by yearly average predicted errors is varied in its collection of positive and negative values inching closer towards zero.
Twenty-Year, Quarterly, and Yearly
Main question: How do firms differ by forecast inaccuracy by average type (yearly, quarterly, and yearly)?
Steps:
1) Calculate the average yearly, quarterly, and twenty-difference for each firm.
2) Convert the new average differences to their absolute value.
3) Isolate the top 20 and bottom 20 firms by their average absolute distances.
4) Convert Firm IDs to Firm Tickers
5) Use those firm tickers to create a visualization using 2 different datasets:
avgs DF (all raw data)all_diffs DF (filtered data)#calculate average difference for all average types
all_diffs_means = all_diffs.groupby('firm_ticker').mean().reset_index()
#get firm tickers of top/bottom 20 firms
all_diffs_top_ids = all_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
all_diffs_bottom_ids = all_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
#filter DF for top/bottom firm tickers
all_diffs_top = all_diffs[all_diffs.firm_ticker.isin(all_diffs_top_ids)]
all_diffs_bottom = all_diffs[all_diffs.firm_ticker.isin(all_diffs_bottom_ids)]
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = all_diffs_top.firm_ticker, y = all_diffs_top.difference, order = all_diffs_top_ids, errwidth = 0.5)
plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('across All Average Types', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-all-top.png')
The top 8 most inaccurate firms by all average types are AIG, LRCX, GM, BKNG, CHTR, AGN, NVR, and GOOGL.
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])
sb.pointplot(x = all_diffs_bottom.firm_ticker, y = all_diffs_bottom.difference, order = all_diffs_bottom_ids, errwidth = 0.5)
plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('across All Average Types', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)
plt.savefig(PATH_BIVARIATE + 'avgs-diff-all-bottom.png')
The same trend of all values' absolute values steadily approaching 0. Nothing new.
Now, let's look at the top 20 most inaccurate firms, where I isolated the first couple of firms before the general trend started normalizing, approaching to 0.
twenty-year average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, and NVR.
quarterly average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, NVR, and AZO.
yearly average prediction errors: AIG, LRCX, GM, BKNG, CHTR, and NVR.
all average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, NVR, and GOOGL.
From the previous summaries, some interesting observations pop up:
Observation 1: AIG is consistently the most inaccurately-predicted firm across all twenty-year, quarterly, and/or yearly average types.
Observation 2: The firms in common—AIG, LRCX, GM, BKNG, CHTR, and NVR—are all ranked in the same order on the x-axis. Across all average types, these firms are consistent in their order.
Observation 3: AGN and GOOGL appears in all the above average types except yearly. This means GOOGL EPS more accurately forecasted on a solely by-year basis.
Observation 4: AZO appears only in the top 20 most inaccurate quarterly average prediction error DataFrame.
Isolate Twenty-Year EPS data by firm
#helper function to separate eps_act and eps_fc averages into their own columns
def separate_eps_fc_act(df, groupby_arr, value_col):
#combine eps_fc and eps_act averages into a single column, joined on firm id and time period
df1 = df.groupby(groupby_arr)[value_col].apply(lambda x: ', '.join(x.astype(str))).reset_index()
#separate eps_fc and eps_act averages then rename columns
df1 = pd.concat([df1[groupby_arr], df1[value_col].str.split(', ', expand = True)], axis = 1)
df1.rename(columns = {0: 'eps_fc', 1: 'eps_act'}, inplace = True)
#convert data types
df1 = df1.astype({'eps_fc' : 'float64', 'eps_act' : 'float64'})
#add intercept
df1['intercept'] = 1
return df1
avgs_twenty_ols = separate_eps_fc_act(avgs_twenty, ['firm_id', 'average_type'], 'average')
avgs_twenty_ols.sample(10)
ax = sb.lmplot(data= avgs_twenty_ols, x = 'eps_fc', y = 'eps_act', height = 10,
scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})
plt.title('Actual vs. Forecasted EPS', size = 25)
plt.suptitle('as Twenty-Year Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)
plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-twenty.png')
Observation 1: The relationship between actual and forecasted EPS shows a wide variance in y-values from x = 0.0 to x = 9.0.
Observation 2: Most of the points are clustered between 0.0 and 2.5 on the x-axis. Besides that cluster, all other points are outliers, which contribute not only to the scatterplot's visible variance, but also lead the trend towards a general positive linear trend (after ignoring the outlier at x = 5.0, y = 0.25, of course)
Isolate Quarterly EPS data by firm
avgs_quarter_ols = separate_eps_fc_act(avgs_quarter, ['firm_id', 'average_type', 'time_period'], 'average')
avgs_quarter_ols.sample(10)
ax = sb.lmplot(data= avgs_quarter_ols, x = 'eps_fc', y = 'eps_act', height = 10,
scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})
plt.title('Actual vs. Forecasted EPS', size = 20)
plt.suptitle('as Quarterly Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)
plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-quarter.png')
Observation 1: Unlike the previous graph showing the twenty-year averages of forecasted and actual EPS, the 95% confidence interval for quarterly averages is much more narrow.
Observation 2: There is a similar pattern of all points being "clustered" at the beginning, with outliers being abundant outside of the initial x-range. Here, that range is from x = -1 to x = 3.5.
Observation 3: Outliers among EPS quarterly averages are much more varied and pronounced, which causes the regression line to be much less steep than the regression line for twenty-year EPS averages.
Isolate Yearly EPS data by firm
avgs_year_ols = separate_eps_fc_act(avgs_year, ['firm_id', 'average_type', 'time_period'], 'average')
avgs_year_ols.sample(10)
ax = sb.lmplot(data= avgs_year_ols, x = 'eps_fc', y = 'eps_act', height = 10,
scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})
plt.title('Actual vs. Forecasted EPS', size = 20)
plt.suptitle('as Yearly Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)
plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-year.png')
Observation 1: There is significantly less variance among yearly EPS averages, with only 2 outliers at x = -4 and x = 0.5.
Observation 2: Compared to twenty-year and quarterly EPS averages, the 95% confidence interval for yearly EPS averages is much more narrow than both of them.
Observation 3: Residuals tend to stay closer to the regression line, implying that there is lower bias in the relationship between forecasted and actual yearly EPS averages.
Isolate Raw EPS data by firm
features_eps_fc_act = features.loc[features['feature'].isin(['eps_act', 'eps_fc'])]
features_all_ols = separate_eps_fc_act(features_eps_fc_act, ['firm_id', 'term'], 'value')
features_all_ols.sample(10)
ax = sb.lmplot(data= features_all_ols, x = 'eps_fc', y = 'eps_act', height = 10,
scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})
plt.suptitle('Actual vs. Forecasted EPS', size = 20)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)
plt.savefig(PATH_BIVARIATE + 'features-act-fc-all.png')
Observation 1: The raw actual vs. forecasted EPS help paint a better overall picture of the trends.
Observation 2: The previous graphs depicting averages instead of raw data have clustered points at the beginning of the x-axis. But here, the scatterplot starts with 4 outliers until the first "cluster" of points takes place.
Observation 3: From x = 0 onward, the initial cluster of points gradually declusters afterwards while maintaining minimal distance from the regression line as residuals. Out of all the regression plots depicting actual vs. forecasted EPS, the raw data displays the least amount of variance. This potentially could mean the lowest amount of bias as well.
Observation 4: The slope for raw EPS data is less steep than all of the three previous graphs, and this relationship depicts a slight positive linear relationship: the weakest linear trend out of them all.
Observation 1: Interestingly enough, the graph encompassing the broadest data (all data from features) has the most stable 95% confidence interval as per the regression line.
Observation 2: The quarterly average regression line is the most unstable.
1) Create a DF depicting eps_act last quarter vs eps_act this quarter.
2) Create "dumb predictions"
3) Compare my predictions to all eps_fc values
#isolate eps_act entries, drop date column
features_dumb_eps = features[features['feature'] == 'eps_act'].loc[:, features.columns != 'date']
#grab previous eps_act term
features_dumb_eps['previous_value'] = features_dumb_eps['value'].shift(1)
features_dumb_eps = features_dumb_eps[['firm_id', 'feature','term', 'value', 'previous_value']]
features_dumb_eps.head()
"Dumb Prediction" = the moving average of the values from the last 2 quarters.
#create dumb prediction
features_dumb_eps['dumb_prediction'] = features_dumb_eps.value.rolling(2).mean()
#preview
features_dumb_eps.head(5)
#create column showing corresponding eps_fc for that term
#create Series of eps_fc values
features_fc = features[features.feature == 'eps_fc'].drop(columns = ['date', 'feature'])
features_fc.head()
#merge eps_fc values with dumb predictions on 'firm_id' and 'term'
features_dumb_eps = features_dumb_eps.merge(features_fc, on = ['term', 'firm_id'], how = 'left')
features_dumb_eps.rename(columns = {'value_x' : 'value', 'value_y' : 'eps_fc_value'}, inplace = True)
#create 3-year moving averages for all numerical fields
features_dumb_eps['ma_value'] = features_dumb_eps.value.rolling(12).mean()
features_dumb_eps['ma_previous_value'] = features_dumb_eps.previous_value.rolling(12).mean()
features_dumb_eps['ma_dumb_prediction'] = features_dumb_eps.dumb_prediction.rolling(12).mean()
features_dumb_eps['ma_eps_fc_value'] = features_dumb_eps.eps_fc_value.rolling(12).mean()
features_dumb_eps.sample(5)
features_dumb_eps.head()
fig = plt.figure(figsize = [10, 10])
sb.scatterplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'green')
sb.scatterplot(data = features_dumb_eps, x = 'eps_fc_value', y = 'value', color = 'purple')
fig.legend(labels = ['My Forecasted Eps', 'Forecasted EPS'], prop = {'size' : 15}, loc = 'lower right')
plt.suptitle('Personal EPS Forecasts vs. Actual EPS', size = 20, y = .94)
# plt.title('in 3-year Moving Averages', size = 17)
plt.ylabel('EPS Value')
plt.xlabel('Actual EPS')
plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-fc-scatter.png')
plt.show();
Observation 1: With actual EPS as the x-value, we can see that my personal forecasts show much more variance with outliers spreading across the entire scatterplot, in all 4 quartiles. If we ignored the outliers, then all of my personal forecasts cluster around (0, 0).
Observation 2: Similarly, experts' forecasted EPS has only 3 outliers and generally stays uniformly clustered around (0,0). This means that my approach of generating personal forecasts by 2-quarter averages is much less accurate than the method that Bloomberg forecasters had used.
fig = plt.figure(figsize = [10, 10])
sb.regplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'green')
sb.regplot(data = features_dumb_eps, x = 'eps_fc_value', y = 'value', color = 'purple')
fig.legend(labels = ['My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 15}, loc = 'lower right')
plt.suptitle('Forecasted EPS vs. Actual EPS', size = 20, y = .94)
# plt.title('in 3-year Moving Averages', size = 17)
plt.ylabel('EPS')
plt.xlabel('Actual EPS')
plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-fc-reg.png')
plt.show();
Observation 1: Not surprisingly, my personal forecasted EPS has a much wider 95% confidence interval than experts' forecasted EPS. Due to the wider variance for the former, those 2-quarter rolling averages would subsequently come equipped with a broader confidence interval scope.
Observation 2: Experts' forecasted EPS, while counting outliers, exhibits a greater rate of increase compared to my personal forecasts after x = 0. This means that when compared to actual EPS values, experts' forecasts tend to surpass my personal forecasts.
Observation 3: When ignoring outliers, experts' forecasted EPS and my personal forecasts clusters uniformly around (0, 0), with slightly more of my personal forecasts taking up the left side of the "cluster:" before x = 0. This confirms my previous statement that although both variables show a general consistent positive linear trend, my personal forecasts are less reliable and more inaccurate due to having more variance, outliers insonsistent with the pairing regression line trend, and moving-averages depend solely on historical data.
Observation 4: Most of my personally generated outliers reside in Quartile I and the x-axis side of Quartile IV. All experts' forecasted EPS outliers reside in Quartiles II and III. Both forecasts' outliers are scattered in completely different quartiles from each other. This observation helps to further accentuate that although both regression lines share a linear positive trend, the distribution of outliers shows that both variables are not necessarily correlated to each other in terms of actual EPS.
fig = plt.figure(figsize = [10, 10])
sb.scatterplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'purple')
plt.suptitle('Actual EPS vs. My Predicted EPS', size = 20, y = .94)
plt.ylabel('Actual EPS')
plt.xlabel('My Predicted EPS')
plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-scatter.png')
plt.show();
fig = plt.figure(figsize = [10, 10])
sb.regplot(data = features_dumb_eps, x = 'dumb_prediction', y = 'value', color = 'purple')
plt.suptitle('Actual EPS vs. My Predicted EPS', size = 20, y = .94)
plt.ylabel('Actual EPS')
plt.xlabel('My Predicted EPS')
plt.savefig(PATH_MULTIVARIATE + 'features-dumb-act-reg.png')
plt.show();
value_vars = ['value', 'previous_value', 'dumb_prediction', 'eps_fc_value', 'ma_value', 'ma_previous_value',
'ma_dumb_prediction', 'ma_eps_fc_value']
#melt dumb_predictions and eps_fc_value to prepare for plotting
df_dumb_eps_melt = pd.melt(features_dumb_eps, id_vars = ['firm_id', 'feature', 'term'],
value_vars = value_vars,
var_name = 'value_type')
#isolate only dumb predictions and eps_fc_value
dumb_eps_fc = df_dumb_eps_melt[df_dumb_eps_melt.value_type.isin(['dumb_prediction', 'eps_fc_value', 'value'])]
dumb_eps_fc['ma_value'] = dumb_eps_fc.value.rolling(12).mean()
dumb_eps_fc.sample(5)
plt.figure(figsize = [20, 10])
ax = sb.pointplot(x = 'term', y = 'value', data = dumb_eps_fc, hue = 'value_type',
errwidth = .5, palette = 'Set2')
plt.xticks(rotation = 'vertical')
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Actual EPS', 'My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 20})
plt.title('Average Forecasted EPS and My Predicted EPS by Term', size = 20)
plt.ylabel('Average EPS Value')
plt.xlabel('Term')
plt.savefig(PATH_MULTIVARIATE + 'features-dumb-eps.png')
plt.show();
#isolate actual EPS and personal predicted EPS
ma_dumb_fc = df_dumb_eps_melt[df_dumb_eps_melt.value_type.isin(['ma_dumb_prediction', 'ma_value', 'ma_eps_fc_value'])]
ma_dumb_fc.value_type.unique()
plt.figure(figsize = [20, 10])
ax = sb.pointplot(x = 'term', y = 'value', data = ma_dumb_fc, hue = 'value_type',
errwidth = .5, palette = "Set2")
plt.xticks(rotation = 'vertical')
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Actual EPS', 'My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 20})
plt.suptitle('Average Forecasted EPS and My Predicted EPS by Term', size = 20, y = .95)
plt.title('in 3-Year Moving Averages from 1999 - 2019', size = 17)
plt.ylabel('Moving Average EPS Value')
plt.xlabel('Term')
plt.savefig(PATH_MULTIVARIATE + 'features-dumb-eps-ma.png')
plt.show();
Observation 1: In the first figure, my predicted EPS has strayed drastically away from the more normal actual EPS general trend pattern in 2 instances. Between 2000Q4 and 2001Q1, my forecasts become optimistic, forming a 2-point peak before jumping back into the uniform trend. Between 2008Q4 and 2009Q1, my forecasts become pessimistic, forming a 2-point trough before hopping back up into the uniform trend similar to the actual EPS trend.
Observation 2: When plotting moving averages per term, both trend lines follow an almost-identical polynomial trend, overall leading into a direction of a positive slope.
The technique that the Bloomberg forecasters used was much more effective than my method of predicting by using 2-quarter moving averages.
#add percentage_error column to features
features_pct = features[features.feature.isin(['eps_fc', 'eps_act'])]
features_pct = separate_eps_fc_act(features_pct, ['firm_id', 'term'], 'value')
features_pct['pct_error'] = (features_pct.eps_act - features_pct.eps_fc) / features_pct.eps_act
features_pct['pct_error'] = features_pct['pct_error'].replace([np.inf, -np.inf], np.nan)
#get absolute value of percentage errors
features_pct['pct_error_abs'] = features_pct.pct_error.abs()
#add ticker column for plotting assistance
features_pct['firm_ticker'] = convert_ids_to_tickers(features_pct.firm_id)
#add 3-year moving averages for percentage error
features_pct['ma_pct_error'] = features_pct.pct_error.rolling(12).mean()
#grab top/bottom 10 firms
features_pct_top_ids = features_pct.sort_values(by = 'pct_error_abs', ascending = False).firm_id.drop_duplicates().head(5).values
features_pct_bottom_ids = features_pct.sort_values(by= 'pct_error_abs', ascending = True).firm_id.drop_duplicates().head(5).values
#filter DF for those 10 firms
features_pct_top = features_pct[features_pct.firm_id.isin(features_pct_top_ids)]
features_pct_bottom = features_pct[features_pct.firm_id.isin(features_pct_bottom_ids)]
Percentage Error by Year
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = features_pct_top, x = features_pct_top.term.dt.year, y = 'pct_error',
hue = 'firm_ticker', dodge = True, legend_out = False, errwidth = 0.5)
plt.xlabel('Year')
plt.ylabel('Average Percentage Error')
plt.title('Average Percentage Error by Year (for the Top 5 Most Inaccurate Firms)', size = 20)
plt.savefig(PATH_MULTIVARIATE + 'pct-firm-year-top.png')
plt.show()
The above visual isn't great for analysis. Let's look at a strip plot distribution instead.
plt.figure(figsize = [20, 10])
sb.stripplot(data = features_pct_top, x = features_pct_top.term.dt.year, y = 'pct_error', hue = 'firm_ticker')
plt.xticks(rotation = 'vertical')
plt.suptitle('Distribution of EPS Percentage Error by Year', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)
plt.xlabel('Year')
plt.ylabel('Percentage Error')
plt.savefig(PATH_MULTIVARIATE + 'pct-firm-year-top-strip.png')
Observation 1: The most notable outlier is IRM, with a percentage error of over -1000 for EPS in the year 2014.
Observation 2: BIM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by year.
Observation 3: All other firms beside IRM share large distance gaps from 2014 - 2018 with IRM in 2014.
Observation 4: Outliers only started showing up in distributions starting on the year 2014. Therefore, this means that EPS forecasts have become more inaccurate in the more recent years, starting from 2014.
Percentage Error by Quarter
plt.figure(figsize = [10, 10])
ax = sb.stripplot(data = features_pct_top, x = features_pct_top.term.dt.quarter, y = 'pct_error',
hue = 'firm_ticker', jitter = True)
plt.suptitle('Distribution of EPS Percentage Error by Quarter', size = 15, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 13)
plt.ylabel('Average Percentage Error', size = 12)
plt.xlabel('Quarter', size = 12)
plt.legend(title = 'Firm Ticker')
plt.savefig(PATH_MULTIVARIATE + 'pct-firm-quarter-top-strip.png')
Observation 1: The most notable outlier is IRM yet again, with a percentage error of over -1000 for EPS in Quarter 3.
Observation 2: IBM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by quarter. This list is the same as the previous stripplot and pointplot depicting years.
Observation 3: Quarter 4 is the only quarter with no outliers. This makes intuitive sense because as the quarters in any singular year pass and familiarity with the current fiscal year builds, the more accurate forecasted EPS would be. Let us test this theory.
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = features_pct_top, x = features_pct_top.term.dt.quarter, y = 'pct_error', hue = 'firm_ticker', dodge = True,
legend_out = False, errwidth = 0.5)
plt.xlabel('Quarter', size = 17)
plt.ylabel('Average Percentage Error', size = 17)
plt.title('Average Percentage Error by Quarter (for the Top 5 Most Inaccurate Firms)', size = 20)
plt.xticks(size = 15)
plt.legend(title = 'Firm Ticker')
plt.savefig(PATH_MULTIVARIATE + 'pct-firm-quarter.png')
plt.show()
Observation 1: The firms IRM and IBM deviate greatly in Quarter 3, where forecasters' predictions veer away from zero. Meanwhile, the other firms cluster more closely around 0. The most inaccurate predictions, on average, happen to be made during Quarter 3 in any given year.
Observation 2: The previous observation disproves my initial intuition that forecasters' predictions would become more accurate by the quarter.
Percentage Error by Year and Quarter
plt.figure(figsize = [22, 15])
ax = sb.pointplot(data = features_pct_top, x = 'term', y = 'pct_error', hue = 'firm_ticker', dodge = True,
legend_out = False)
plt.xlabel('Term')
plt.ylabel('Percentage Error')
plt.title('Percentage Error by Year and Quarter (for the Top 5 Most Inaccurate Firms)', size = 20)
plt.xticks(rotation = 'vertical')
plt.legend(title = 'Firm Ticker')
plt.savefig(PATH_MULTIVARIATE + 'pct-firm-term.png')
plt.show()
This isn't a very good visualization either. Let's look at a stripplot instead:
plt.figure(figsize = [20, 10])
sb.stripplot(data = features_pct_top, x = 'term', y = 'pct_error', hue = 'firm_ticker')
plt.suptitle('Distribution of EPS Percentage Error by Term', size = 20, y = .94)
plt.title('for the top 5 Most Inaccurate Firms', size = 17)
plt.xlabel('Term', size = 17)
plt.ylabel('Percentage Error', size = 17)
plt.xticks(rotation = 'vertical')
plt.legend(title = 'Firm Ticker')
plt.savefig(PATH_MULTIVARIATE + 'pct-ferm-tirm-strip.png')
plt.show();
Observation 1: The most notable outlier is IRM, with a percentage error of over -1000 for EPS in the term 2014Q3.
Observation 2: BIM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by term. This list is consistent for the previous plots feature standalone quarters and years.
Observation 3: outliers start after 2014Q3. In the next terms, outliers tend to year in Q1 and Q3 exclusively; much more Q3.
Observation 4: All other firms beside IRM share large distance gaps from 2014Q3 - 2018Q3 from IRM in 2014Q3.
Observation 5: Outliers only started showing up in distributions starting in the year 2014. Therefore, this means that EPS forecasts have become more inaccurate in the more recent terms, starting from 2014Q3. This conclusion is consistent with the observation made in the quarterly and yearly stripplots & pointplots.
features_pct['difference'] = features_pct.eps_act - features_pct.eps_fc
features_pct['difference_abs'] = features_pct.difference.abs()
diffs_top_ids = features_pct.sort_values(by = 'difference_abs', ascending = False).firm_id.drop_duplicates().head(5).values
diffs_bottom_ids = features_pct.sort_values(by = 'difference_abs', ascending = True).firm_id.drop_duplicates().head(5).values
diffs_top = features_pct[features_pct.firm_id.isin(diffs_top_ids)]
diffs_bottom = features_pct[features_pct.firm_id.isin(diffs_bottom_ids)]
Prediction Error by Term
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term, y = 'difference', hue = 'firm_ticker', errwidth = 0.5)
plt.legend(title = 'Firm Ticker')
plt.xticks(rotation = 'vertical')
plt.xlabel('Term')
plt.ylabel('Average Prediction Error')
plt.suptitle('Average EPS Prediction Error by Term', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)
plt.savefig(PATH_MULTIVARIATE + 'features-perror-term-firm.png')
plt.show();
Observation 1: The 2 most noticeable outliers are the firm AIG in 2008Q4 (optimistic forecasts), and LRCX in 2000Q3 (pessimistic forecasts).
Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN.
Observation 3: VRSN is the most stable of the top 5 most inaccurate firms; its average prediction error by term hugs closely to a slopeless y = 0 trend.
Prediction Error by Year
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term.dt.year, y = 'difference', hue = 'firm_ticker',errwidth = 0.5)
plt.legend(title = 'Firm Ticker')
plt.xticks(rotation = 'vertical')
plt.xlabel('Year')
plt.ylabel('Prediction Error')
plt.suptitle('Average EPS Prediction Error by Year', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)
plt.savefig(PATH_MULTIVARIATE + 'features-perror-year-firm.png')
plt.show();
Observation 1: Yet again, the 2 most noticeable outliers are the firm AIG in 2008 (optimistic forecasts), and LRCX in 2000 (pessimistic forecasts).
Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN. The same list as before.
Observation 3: VRSN is the most stable of the top 5 most inaccurate firms; it contains no outliers. Data for the firm CHTR shows up only in 2017.
Prediction Error by Quarter
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term.dt.quarter, y = 'difference', hue = 'firm_ticker',errwidth = 0.5)
plt.legend(title = 'Firm Ticker')
plt.xlabel('Quarter')
plt.ylabel('Prediction Error')
plt.suptitle('Average EPS Prediction Error by Quarter', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)
plt.xticks(size = 15)
plt.savefig(PATH_MULTIVARIATE + 'features-perror-quarter-firm.png')
plt.show();
Observation 1: One most noticable outlier is AIG, which is, on average, the only outlier in Q3 of any given year.
Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN. The same list as generated through a yearly and term basis.
Observation 3: These 5 firms display relatively low prediction errors until Q4, where both LRCX and AIG "branch off" into opposite directions. This means that forecasters, on average, are likely to be more inaccurate in their EPS forecasts in Q4 of any given year.
Observation 4: VRSN is the most stable of the top 5 most inaccurate firms; it contains no outliers. Data for the firm CHTR shows up only in 2017.
find root mean square error for both "dumb" forecasts and Bloomberg forecasts
#isolate both forecast EPS types and actual EPS
eps_fc_all = features_dumb_eps[['firm_id', 'feature', 'term', 'value', 'dumb_prediction', 'eps_fc_value']]
eps_fc_all = eps_fc_all.rename(columns = {'value' : 'eps_act', 'dumb_prediction' : 'eps_fc_db', 'eps_fc_value' : 'eps_fc'})
eps_fc_all.sample(5)
#helper function
def calculate_rmse(df, predicted_col):
rmse = np.sqrt(mean_squared_error(df['eps_act'], df[predicted_col]))
return pd.Series(np.array(rmse))
temp = eps_fc_all.dropna()
#number of entries dropped
eps_fc_all.shape[0] - temp.shape[0]
We cannot include entrees with NaN values in our RMSE calculations. We dropped 8337 entries with NaN values for actual EPS, forecasted Bloomberg EPS, and personal EPS forecasts.
Original size: 42420
New size: 34083
#isolate Bloomberg RMSE
firm_rmse_fc = temp.groupby(['firm_id']).apply(calculate_rmse, predicted_col = 'eps_fc').reset_index().rename(columns = { 0: 'rmse_fc'})
firm_rmse_fc.rmse_fc = firm_rmse_fc.rmse_fc.astype('float64')
#isolate dumb RMSE
firm_rmse_fc_db = temp.groupby('firm_id').apply(calculate_rmse, predicted_col = 'eps_fc_db').reset_index().rename(columns = {0: 'rmse_fc_db'})
firm_rmse_fc_db.rmse_fc_db = firm_rmse_fc_db.rmse_fc_db.astype('float64')
#create 1 DataFrame to contain both RMSE types
cross_val_firm = pd.merge(firm_rmse_fc, firm_rmse_fc_db, how = 'outer', on = 'firm_id')
cross_val_firm.head()
#add ticker column for plotting assistance
cross_val_firm['firm_ticker'] = convert_ids_to_tickers(cross_val_firm.firm_id)
cross_val_firm['diff'] = cross_val_firm.rmse_fc_db - cross_val_firm.rmse_fc
cross_val_firm_melt = pd.melt(cross_val_firm, id_vars = ['firm_id', 'firm_ticker', 'diff'],
value_vars = ['rmse_fc', 'rmse_fc_db'],
var_name = 'rmse_type',
value_name = 'rmse')
(cross_val_firm['diff'] > 0).sum() / cross_val_firm.shape[0]
cross_val_firm_melt.sample(5)
reorder by top 20 most inaccurate firms (naive AND Bloomberg rmse)
firm_rmse_fc_top = cross_val_firm_melt[cross_val_firm_melt['rmse_type'] == 'rmse_fc'].sort_values(by = 'rmse', ascending = False)
firm_rmse_fc_db_top = cross_val_firm_melt[cross_val_firm_melt['rmse_type'] == 'rmse_fc_db'].sort_values(by = 'rmse', ascending = False)
firm_rmse_fc_top_tickers = firm_rmse_fc_top.head(20).firm_ticker.drop_duplicates().values
firm_rmse_fc_db_top_tickers = firm_rmse_fc_db_top.head(20).firm_ticker.drop_duplicates().values
set(firm_rmse_fc_top_tickers) ^ set(firm_rmse_fc_db_top_tickers)
#top 10-20 most inaccurate firms by Bloomberg EPS forecasted RMSE
top_firm_rmse_fc = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(firm_rmse_fc_top_tickers)]
top_firm_rmse_fc = top_firm_rmse_fc[top_firm_rmse_fc['rmse_type'] == 'rmse_fc'].sort_values(by = 'rmse', ascending = False)
plt.figure(figsize = [15, 10])
sb.pointplot(data = top_firm_rmse_fc, x = 'firm_ticker', y = 'rmse')
plt.xticks(rotation = '90')
#top 10-20 most inaccurate firms by naive EPS forecasted RMSE
top_firm_rmse_fc_db = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(firm_rmse_fc_db_top_tickers)]
top_firm_rmse_fc_db = top_firm_rmse_fc_db[top_firm_rmse_fc_db['rmse_type'] == 'rmse_fc_db'].sort_values(by = 'rmse', ascending = False)
#top 10-20 most inaccurate firms by Bloomberg EPS forecast RMSE
plt.figure(figsize = [15, 10])
sb.pointplot(data = top_firm_rmse_fc_db, x = 'firm_ticker', y = 'rmse')
plt.xticks(rotation = '90')
Observation 1: The 4 firms that both top inaccurate RMSE share are APA, AVGO, ETFC, and RE.
Observation 2: The 2 most common outliers are LCRX and AIG, which consistently have the 2 largest RMSE values whether by Bloomberg or by naive forecasts.
random_tickers = cross_val_firm_melt.firm_ticker.drop_duplicates().sample(20).values
rand_cross_val_firm = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(random_tickers)]
plt.figure(figsize = [15, 10])
sb.pointplot(data = rand_cross_val_firm, x = 'firm_ticker', y = 'rmse', hue = 'rmse_type', scale = .65)
plt.xticks(rotation = '90')
((cross_val_firm['diff'] > 0).sum()) / cross_val_firm.shape[0]
Observation 1: After manually calculating the proportion of RMSE values greater than 0, 13.05% of my naive forecasts have done better than the Bloomberg EPS forecasts.
cross_val_firm_lim_melt = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(random_tickers)]
cross_val_firm_lim = cross_val_firm[cross_val_firm.firm_ticker.isin(random_tickers)]
import plotly.express as px
fig = px.scatter(cross_val_firm_lim_melt, x = 'rmse', y = 'firm_ticker', color = 'rmse_type')
fig.show()
from plotly import graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(
x = cross_val_firm_lim.rmse_fc,
y = cross_val_firm_lim.firm_ticker,
marker = dict(color = "crimson", size = 12),
mode = 'markers'))
fig.show()
naive_blmbrg = features_dumb_eps[['firm_id', 'term', 'value', 'eps_fc_value', 'dumb_prediction']]
naive_blmbrg = naive_blmbrg.rename(columns = {'value' : 'eps_act',
'eps_fc_value' : 'eps_fc',
'dumb_prediction' : 'naive_fc'})
naive_blmbrg.head()
#add raw prediction errors
naive_blmbrg['raw_pred_err'] = naive_blmbrg['eps_act'] - naive_blmbrg['eps_fc']
naive_blmbrg['naive_pred_err'] = naive_blmbrg['eps_act'] - naive_blmbrg['naive_fc']
naive_blmbrg.head()
#melt for easier plotting
naive_blmbrg = pd.melt(naive_blmbrg, id_vars = ['firm_id', 'term', 'eps_act', 'eps_fc', 'naive_fc'], value_vars = ['raw_pred_err', 'naive_pred_err'],
var_name = 'err_type', value_name = 'err_value')
plt.figure(figsize = [20, 15])
sb.pointplot(data = naive_blmbrg, x = 'term', y = 'err_value', hue = 'err_type', errwidth = 0.5, scale = .65)
plt.xticks(rotation = '90')
plt.figure(figsize = [20, 15])
sb.pointplot(data = naive_blmbrg, x = naive_blmbrg.term.dt.year, y = 'err_value', hue = 'err_type', errwidth = 0.5, scale = .65)
plt.xticks(rotation = '90')
plt.figure(figsize = [20, 15])
sb.pointplot(data = naive_blmbrg, x = naive_blmbrg.term.dt.quarter, y = 'err_value', hue = 'err_type', errwidth = 0.5, scale = .65)
plt.xticks(rotation = '90')
random_ids = naive_blmbrg.sample(n = 5, random_state = 4).firm_id.values
naive_blmbrg_lim = naive_blmbrg[naive_blmbrg.firm_id.isin(random_ids)]
naive_blmbrg_lim_pred = naive_blmbrg_lim[naive_blmbrg_lim['err_type'] == 'raw_pred_err']
#add ticker column for plotting assistance
naive_blmbrg_lim_pred['firm_ticker'] = convert_ids_to_tickers(naive_blmbrg_lim_pred.firm_id)
plt.figure(figsize = [20, 15])
sb.pointplot(data = naive_blmbrg_lim, x = 'term', y = 'err_value', hue = 'firm_id', scale = .5, errwidth = 0.5)
plt.xticks(rotation = '90')
naive_blmbrg_lim_naive = naive_blmbrg_lim[naive_blmbrg_lim['err_type'] == 'naive_pred_err']
plt.figure(figsize = [20, 15])
sb.pointplot(data = naive_blmbrg_lim_naive, x = 'term', y = 'err_value', hue = 'firm_id', scale = .5, errwidth = 0.5)
plt.xticks(rotation = '90')
Question 1: Does average EPS prediction error depict any differences in trends whether by a yearly, quarterly, or full-term basis?
According to figure (features-act-fc-diffs-term), forecasters were most optimistic in 2008, Quarter 4, and most pessimistic in 2000, Quarter 4.
According to figure (features-act-fc-diffs-year), forecasters were most pessimistic in the year 2000, and most optimistic in the year 2008.
The year 2000 coincidentally shows one of the widest variances in average prediction errors (ranging from 0 to 0.5).
The year 2008 also shows one of the widest variances in average prediction errors ranging from -0.25 until -1.5.
The overall trend between the first two figures depicting average EPS prediction error by quarter and year is consistent. When ignoring the pessimistic and optimistic outliers, there is no slope; all average EPS forecasts gather around 0 by the year.
It is only by a quarterly basis that we see a predictable pattern emerge, where outliers (usually optimistic) happen exclusively during Q4.
When comparing the quarterly trends to the full-term trends, indeed, the 2 outliers occur solely in Quarter 4 of 2000 and 2008. Therefore, our yearly and quarterly figures display congruent trends & patterns with the figure depicting full-terms.
Question 2: I generate “dumb EPS forecasts” by calculating the rolling mean of actual EPS from the past 2 quarters. How do my forecasted EPS forecasts compare to Bloomberg forecasts?
As depicted in figure (features-dumb-eps), my average predicted EPS more closely follows the average actual EPS trend instead of the forecasters'. This is a key takeaway: my method of using 2-quarter moving averages was much more effective in predicting avrage actual EPS values than the method used by Bloomberg forecasters.
though only slightly more unstable. Variance among my predictions is much higher, as shown through the error bars per term.
My personal predictions "spiked" and "troughed" in the terms 2000Q4 and 2008Q4—the exact same pattern as depicted through average forecasted EPS.
Compared to the average forecasted EPS, my average predicted EPS is much more congruent with the trend of actual EPS, though only slightly more unstable. All of my predictions contain higher variance than actual EPS, which reduces my method's credibility when accounting for all individual data points.
Bloomberg's forecasted EPS displays less variance, and is slightly more stable and reliable than my method of using rolling 2-quarter, or half-year, averages.
Question 3: What differences and similarities emerge when analyzing the prediction error and percentage error of EPS forecasts?
For prediction errors, forecasters, on average, are likely to be more inaccurate in their predictions in Q4 of any given year. This is depicted in figure (features-perror-quarter-firm), where the top 5 most inaccurate firms by prediction error display relatively low average prediction errors until Q4, where both LRCX and AIG "branch off" into opposite directions on the x-axis.
For percentage errors, EPS forecasts have become more inaccurate in the more recent terms, starting from 2014Q3. Outliers only started to appear in the trend depicted through (pct-tirm-term.png) starting in the year 2014. This conclusion is consistent with that observations made in the quarterly and yearly stripplots: (pct-firm-quarter) and (pct-firm-quarter-top-strip) for quarterly data, and (pct-firm-year) and (pct-firm-year-top-strip) for yearly data.
The top 5 most inaccurate firms in terms of absolute prediction error are AGN, AIG, CHTR, LRCX, and VRSN. Respectively, these firm tickers stand for Allergan plc, American International Group Inc, Charter Communications Inc, Lam Research Corporation, and Verisign.
The top 5 most inaccurate firms in terms of absolute percentage error are IBM, IRM, MCK, PXD, and QRVO. Respectively, these firm tickers stand for IBM Common Stock, Iron Mountain Inc, McKesson Corporation, Pioneer Natural Resources, and Qorvo Inc.
For prediction error, the most notable outlier is AIG. This firm is, on average, the only outlier in Q4 of any given year.
For percentage error, the most notable outlier is IRM, with a percentage error of over -1000 for EPS in the term 2014Q4. Outliers only started appearing starting in the year 2014
Question 4: [TK replace with new questioN]
Question 5: How do EOD Prices trend across all firms from 1999 - 2019?
EOD prices show a general positive trend from 2009, as can be seen with figure (features-eod-term).
EOD prices see a "trough" from 2007 - 2009: the exact years of the Great Recession.
All patterns shown in the Actual vs. Forecasted pointplots are consistent with the stripplot (features-feature-value) at the beginning of the bivariate stage. Actual EPS has the most outliers, and there a few "spikes" in the pointplots where actual EPS significantly veers away from the forecasted EPS trend line.
#convert notebook to HTML
from subprocess import call
call(['python', '-m', 'nbconvert', 'data_exploratory.ipynb'])